Repeat the analysis shown in the R lab of this chapter, but use TOTAL13 as the outcome variable. Please build both the regression model and the decision tree model (for regression). Identify the final models you would select, evaluate the models, and compare the regression model with the tree model.
library(RCurl)
## Loading required package: bitops
#### Example: Alzheimer's Disease
# filename
AD <- read.csv(text=getURL("https://raw.githubusercontent.com/shuailab/ind_498/master/resource/data/AD2.csv"))
AD$ID = c(1:dim(AD)[1])
str(AD)
## 'data.frame': 517 obs. of 18 variables:
## $ AGE : num 71.7 77.7 72.8 69.6 70.9 65.1 79.6 73.6 60.7 70.6 ...
## $ PTGENDER : int 2 1 2 1 1 2 2 2 1 2 ...
## $ PTEDUCAT : int 14 18 18 13 13 20 20 18 19 18 ...
## $ FDG : num 6.82 6.37 6.37 6.37 6.37 ...
## $ AV45 : num 1.11 1.11 1.11 1.11 1.11 ...
## $ HippoNV : num 0.529 0.538 0.269 0.576 0.601 ...
## $ e2_1 : int 1 0 0 0 1 0 0 0 0 1 ...
## $ e4_1 : int 0 0 1 0 0 1 1 1 1 1 ...
## $ rs3818361 : int 1 1 1 1 1 1 1 1 0 0 ...
## $ rs744373 : int 1 0 1 1 1 0 1 1 0 1 ...
## $ rs11136000: int 1 1 1 1 1 0 0 1 0 0 ...
## $ rs610932 : int 1 1 0 1 0 1 1 1 0 1 ...
## $ rs3851179 : int 1 0 1 0 0 1 0 0 1 0 ...
## $ rs3764650 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ rs3865444 : int 0 1 1 0 0 0 1 1 1 0 ...
## $ MMSCORE : int 26 30 30 28 29 30 30 27 28 30 ...
## $ TOTAL13 : num 8 1.67 12 3 10 3.67 4 11 3 9 ...
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
# subset of variables we want in our first model that only uses demographics predictors
AD_demo <- subset(AD, select=c("TOTAL13", "AGE","PTGENDER","PTEDUCAT","ID"))
str(AD_demo)
## 'data.frame': 517 obs. of 5 variables:
## $ TOTAL13 : num 8 1.67 12 3 10 3.67 4 11 3 9 ...
## $ AGE : num 71.7 77.7 72.8 69.6 70.9 65.1 79.6 73.6 60.7 70.6 ...
## $ PTGENDER: int 2 1 2 1 1 2 2 2 1 2 ...
## $ PTEDUCAT: int 14 18 18 13 13 20 20 18 19 18 ...
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
# ggplot: Plot the scatterplot of the data
# install.packages("ggplot2")
library(ggplot2)
p <- ggplot(AD_demo, aes(x = PTEDUCAT, y = TOTAL13))
p <- p + geom_point(size=4)
# p <- p + geom_smooth(method = "auto")
p <- p + labs(title="TOTAL13 versus PTEDUCAT")
print(p)
Scatter plot “TOTAL13 versus PTEDUCAT” shows a weak positive relationship between predictors with TOTAL13
# ggplot: Plot the scatterplot of the data
# install.packages("ggplot2")
library(ggplot2)
p <- ggplot(AD_demo, aes(x = AGE, y = TOTAL13))
p <- p + geom_point(size=4)
# p <- p + geom_smooth(method = "auto")
p <- p + labs(title="TOTAL13 versus AGE")
print(p)
Scatter plot “TOTAL13 versus AGE” shows a weak positive relationship between predictors with TOTAL13
# install.packages("car")
library(car)
# fit a simple linear regression model with only AGE
lm.AD_demo <- lm(TOTAL13 ~ AGE, data = AD_demo)
# use summary() to get t-tests of parameters (slope, intercept)
summary(lm.AD_demo)
##
## Call:
## lm(formula = TOTAL13 ~ AGE, data = AD_demo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.777 -5.598 -1.834 4.063 39.299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2456 3.5461 0.069 0.944819
## AGE 0.1887 0.0486 3.884 0.000116 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.913 on 515 degrees of freedom
## Multiple R-squared: 0.02845, Adjusted R-squared: 0.02657
## F-statistic: 15.08 on 1 and 515 DF, p-value: 0.0001164
‘AGE’ is statistically significant with p-value of 0.000116, rejecting null hypothesis (H0: no relationship between TOTAL 13 and AGE). R-squared is 0.02845, indicating that ‘AGE’ predictor represents only 2.8% of variability in TOTAL 13
# fit the multiple linear regression model with more than one predictor
lm.AD_demo2 <- lm(TOTAL13 ~ AGE + PTGENDER + PTEDUCAT, data = AD_demo)
summary(lm.AD_demo2)
##
## Call:
## lm(formula = TOTAL13 ~ AGE + PTGENDER + PTEDUCAT, data = AD_demo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.691 -5.450 -1.972 4.124 38.582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.52738 4.21805 1.073 0.28363
## AGE 0.16025 0.04867 3.292 0.00106 **
## PTGENDER 2.18848 0.71130 3.077 0.00220 **
## PTEDUCAT -0.34519 0.13026 -2.650 0.00830 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.825 on 513 degrees of freedom
## Multiple R-squared: 0.05359, Adjusted R-squared: 0.04806
## F-statistic: 9.684 on 3 and 513 DF, p-value: 3.164e-06
With all three demographics varibles included into the model, R-squared value increased to 0.05359, indicating 5.4% of variability in TOTAL 13 can be explained by the three variables. P-value of all three variabbles are significant as their p-values are 0.00106, 0.00220,0.00830, all less than 0.01.
# How to detect interaction terms by exploratory data analysis (EDA)
require(ggplot2)
p <- ggplot(AD_demo, aes(x = PTEDUCAT, y = TOTAL13))
p <- p + geom_point(aes(colour=AGE), size=2)
# p <- p + geom_smooth(method = "auto")
p <- p + labs(title="TOTAL13 versus PTEDUCAT")
print(p)
Because the relationship between Total 13 and PTEDUCAT changes according to the levels of AGE, the same scatterplot on two levels of age can be examined.
p <- ggplot(AD_demo[which(AD_demo$AGE < 60),], aes(x = PTEDUCAT, y = TOTAL13))
p <- p + geom_point(size=2)
p <- p + geom_smooth(method = lm)
p <- p + labs(title="TOTAL13 versus PTEDUCAT when AGE < 60")
print(p)
p <- ggplot(AD_demo[which(AD_demo$AGE > 80),], aes(x = PTEDUCAT, y = TOTAL13))
p <- p + geom_point(size=2)
p <- p + geom_smooth(method = lm)
p <- p + labs(title="TOTAL13 versus PTEDUCAT when AGE > 80")
print(p)
TOTAL13 shows very weak positive correlation with PTEDUCAT when AGE < 60. On the other hand, TOTAL13 is negatively correlated with PTEDUCAT for those who are older than 80 years old.
p <- ggplot(AD_demo, aes(x = AGE, y = TOTAL13))
p <- p + geom_point(size=2)
# p <- p + geom_smooth(method = "auto")
p <- p + labs(title="TOTAL13 versus Age")
print(p)
# re-draw the scatterplot for male
p <- ggplot(AD_demo[which(AD_demo$PTGENDER == 1),], aes(x = AGE, y = TOTAL13))
p <- p + geom_point(size=2)
# p <- p + geom_smooth(method = "auto")
p <- p + labs(title="TOTAL13 versus Age - male")
print(p)
# re-draw the scatterplot for female
p <- ggplot(AD_demo[which(AD_demo$PTGENDER == 2),], aes(x = AGE, y = TOTAL13))
p <- p + geom_point(size=2)
# p <- p + geom_smooth(method = "auto")
p <- p + labs(title="TOTAL13 versus Age - female")
print(p)
Scatter plot between AGE and TOTAL13 for male shows more spread out patterns than those for female.
# fit the multiple linear regression model with an interaction term: AGE*PTEDUCAT
lm.AD_demo2 <- lm(TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + AGE*PTEDUCAT, data = AD_demo)
summary(lm.AD_demo2)
##
## Call:
## lm(formula = TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + AGE * PTEDUCAT,
## data = AD_demo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.259 -5.233 -1.888 3.890 38.180
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.61428 20.96925 -0.935 0.35003
## AGE 0.48724 0.28244 1.725 0.08511 .
## PTGENDER 2.25736 0.71344 3.164 0.00165 **
## PTEDUCAT 1.15343 1.28174 0.900 0.36860
## AGE:PTEDUCAT -0.02042 0.01737 -1.175 0.24042
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.822 on 512 degrees of freedom
## Multiple R-squared: 0.05614, Adjusted R-squared: 0.04877
## F-statistic: 7.613 on 4 and 512 DF, p-value: 5.795e-06
Including the interaction term increased R-squared value from 0.05359 (from the AD_demo without the interation term) to 0.05614. However, the interaction term is not statistically significant to reject the null hypothesis because the p-value was 0.24, more than 0.05.
# Conduct diagnostics of the model
# install.packages("ggfortify")
require("ggfortify")
## Loading required package: ggfortify
## Warning: namespace 'DBI' is not available and has been replaced
## by .GlobalEnv when processing object 'quiet'
## Warning: namespace 'DBI' is not available and has been replaced
## by .GlobalEnv when processing object 'quiet'
autoplot(lm.AD_demo2, which = 1:6, ncol = 3, label.size = 3)
Residuals vs fitted values is used to detect non-linearity, unequal error variances and outliers. Our graph shows no significant patterns, indicating the model is fairly fit. The residuals is clustered around the fitted line. Scale-Location and standarized resial shows no significant patterns either.
Normal Q-Q shows if the data came from some theoretical distribution (ex. normal or exponential). Our graph is not perfectly straight, showing the opportunity to improve the model.
We tried add more predictors to improve the model:
# fit a full-scale model
AD_full <- AD[,c(1:17)]
lm.AD <- lm(TOTAL13 ~ ., data = AD_full)
# anova(lm.AD, type=3)
summary(lm.AD)
##
## Call:
## lm(formula = TOTAL13 ~ ., data = AD_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.4871 -3.5459 -0.2206 3.2142 16.3413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.69987 5.77875 14.484 < 2e-16 ***
## AGE -0.01408 0.03685 -0.382 0.702677
## PTGENDER 0.54352 0.50309 1.080 0.280502
## PTEDUCAT -0.07920 0.09323 -0.850 0.396007
## FDG -2.47515 0.42761 -5.788 1.26e-08 ***
## AV45 4.49975 1.33614 3.368 0.000816 ***
## HippoNV -24.23077 3.83974 -6.311 6.13e-10 ***
## e2_1 -0.38772 0.84406 -0.459 0.646187
## e4_1 -0.35505 0.54318 -0.654 0.513631
## rs3818361 -0.30218 0.50394 -0.600 0.549017
## rs744373 -0.68404 0.48059 -1.423 0.155263
## rs11136000 -0.37215 0.50482 -0.737 0.461349
## rs610932 0.56069 0.50049 1.120 0.263137
## rs3851179 -0.24509 0.48188 -0.509 0.611254
## rs3764650 1.17394 0.60683 1.935 0.053609 .
## rs3865444 0.29680 0.47860 0.620 0.535449
## MMSCORE -1.65084 0.13733 -12.021 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.36 on 500 degrees of freedom
## Multiple R-squared: 0.5673, Adjusted R-squared: 0.5535
## F-statistic: 40.97 on 16 and 500 DF, p-value: < 2.2e-16
To predict TOTAL13 VALUE, a full model was built with all the demographics,genetics and imaging variables. FDG, AV45, Hipponv, MMSCORE are significant based on p-values. R-squared is now increased to 0.5673, indicating 56% of the variability in TOTAL13 can be explained by the variables.
# Do we need all the variables?
# remove age, as it is least significant with p-value of 0.702677
lm.AD.reduced <- lm.AD;
lm.AD.reduced <- update(lm.AD.reduced, ~ . - AGE);
summary(lm.AD.reduced);
##
## Call:
## lm(formula = TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV +
## e2_1 + e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 +
## rs3851179 + rs3764650 + rs3865444 + MMSCORE, data = AD_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.565 -3.548 -0.245 3.215 16.509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.49682 4.84064 17.043 < 2e-16 ***
## PTGENDER 0.53185 0.50174 1.060 0.289644
## PTEDUCAT -0.07461 0.09237 -0.808 0.419613
## FDG -2.47651 0.42724 -5.797 1.20e-08 ***
## AV45 4.45032 1.32872 3.349 0.000871 ***
## HippoNV -23.70644 3.58287 -6.617 9.46e-11 ***
## e2_1 -0.38256 0.84324 -0.454 0.650254
## e4_1 -0.31301 0.53145 -0.589 0.556142
## rs3818361 -0.29665 0.50330 -0.589 0.555852
## rs744373 -0.68368 0.48018 -1.424 0.155122
## rs11136000 -0.37859 0.50411 -0.751 0.453002
## rs610932 0.56613 0.49986 1.133 0.257941
## rs3851179 -0.24467 0.48147 -0.508 0.611548
## rs3764650 1.17249 0.60630 1.934 0.053694 .
## rs3865444 0.29740 0.47819 0.622 0.534271
## MMSCORE -1.65351 0.13704 -12.066 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.355 on 501 degrees of freedom
## Multiple R-squared: 0.5672, Adjusted R-squared: 0.5542
## F-statistic: 43.77 on 15 and 501 DF, p-value: < 2.2e-16
R-squared value was slightly reduced from 0.5673 to 0.5672, but almost no effects. We can use F-test to compare the full model with this new model by applying anova() function.
anova(lm.AD.reduced,lm.AD)
## Analysis of Variance Table
##
## Model 1: TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + e2_1 +
## e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 + rs3851179 +
## rs3764650 + rs3865444 + MMSCORE
## Model 2: TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV +
## e2_1 + e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 +
## rs3851179 + rs3764650 + rs3865444 + MMSCORE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 501 14367
## 2 500 14363 1 4.1901 0.1459 0.7027
By F-test, p-velue is 0.7027 which indicates that two models are statisticaly indistinguishable. We tried to remove the latest last significant predictor, e2_1.
#further remove rs11136000, as it is now the least significant
lm.AD.reduced <- update(lm.AD.reduced, ~ . - e2_1);
summary(lm.AD.reduced);
##
## Call:
## lm(formula = TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV +
## e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 + rs3851179 +
## rs3764650 + rs3865444 + MMSCORE, data = AD_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5468 -3.5931 -0.2865 3.2054 16.5573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.39428 4.83153 17.053 < 2e-16 ***
## PTGENDER 0.52655 0.50120 1.051 0.293959
## PTEDUCAT -0.07449 0.09230 -0.807 0.420012
## FDG -2.47157 0.42676 -5.792 1.23e-08 ***
## AV45 4.47763 1.32631 3.376 0.000792 ***
## HippoNV -23.78408 3.57595 -6.651 7.62e-11 ***
## e4_1 -0.27389 0.52399 -0.523 0.601422
## rs3818361 -0.28798 0.50254 -0.573 0.566862
## rs744373 -0.67756 0.47961 -1.413 0.158349
## rs11136000 -0.38080 0.50369 -0.756 0.449992
## rs610932 0.57090 0.49936 1.143 0.253472
## rs3851179 -0.22076 0.47820 -0.462 0.644534
## rs3764650 1.17620 0.60577 1.942 0.052737 .
## rs3865444 0.29636 0.47780 0.620 0.535363
## MMSCORE -1.65314 0.13692 -12.073 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.351 on 502 degrees of freedom
## Multiple R-squared: 0.567, Adjusted R-squared: 0.5549
## F-statistic: 46.95 on 14 and 502 DF, p-value: < 2.2e-16
anova(lm.AD.reduced,lm.AD)
## Analysis of Variance Table
##
## Model 1: TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + e4_1 +
## rs3818361 + rs744373 + rs11136000 + rs610932 + rs3851179 +
## rs3764650 + rs3865444 + MMSCORE
## Model 2: TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV +
## e2_1 + e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 +
## rs3851179 + rs3764650 + rs3865444 + MMSCORE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 502 14373
## 2 500 14363 2 10.092 0.1757 0.8389
We can repeat this process until no more variable could be deleted or use step() function to achieve the automation of this.
# Automatic model selection
lm.AD.F <- step(lm.AD, direction="backward", test="F")
## Start: AIC=1752.68
## TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV +
## e2_1 + e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 +
## rs3851179 + rs3764650 + rs3865444 + MMSCORE
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - AGE 1 4.2 14367 1750.8 0.1459 0.7026767
## - e2_1 1 6.1 14369 1750.9 0.2110 0.6461870
## - rs3851179 1 7.4 14370 1751.0 0.2587 0.6112542
## - rs3818361 1 10.3 14373 1751.0 0.3596 0.5490167
## - rs3865444 1 11.0 14374 1751.1 0.3846 0.5354486
## - e4_1 1 12.3 14375 1751.1 0.4273 0.5136311
## - rs11136000 1 15.6 14378 1751.2 0.5435 0.4613491
## - PTEDUCAT 1 20.7 14383 1751.4 0.7217 0.3960065
## - PTGENDER 1 33.5 14396 1751.9 1.1672 0.2805017
## - rs610932 1 36.1 14399 1752.0 1.2550 0.2631368
## <none> 14363 1752.7
## - rs744373 1 58.2 14421 1752.8 2.0259 0.1552625
## - rs3764650 1 107.5 14470 1754.5 3.7425 0.0536092 .
## - AV45 1 325.8 14688 1762.3 11.3416 0.0008163 ***
## - FDG 1 962.4 15325 1784.2 33.5041 1.257e-08 ***
## - HippoNV 1 1143.9 15506 1790.3 39.8227 6.134e-10 ***
## - MMSCORE 1 4150.9 18514 1881.9 144.5039 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1750.83
## TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + e2_1 +
## e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 + rs3851179 +
## rs3764650 + rs3865444 + MMSCORE
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - e2_1 1 5.9 14373 1749.0 0.2058 0.6502543
## - rs3851179 1 7.4 14374 1749.1 0.2583 0.6115481
## - e4_1 1 9.9 14377 1749.2 0.3469 0.5561416
## - rs3818361 1 10.0 14377 1749.2 0.3474 0.5558518
## - rs3865444 1 11.1 14378 1749.2 0.3868 0.5342708
## - rs11136000 1 16.2 14383 1749.4 0.5640 0.4530021
## - PTEDUCAT 1 18.7 14386 1749.5 0.6525 0.4196129
## - PTGENDER 1 32.2 14399 1750.0 1.1237 0.2896443
## - rs610932 1 36.8 14404 1750.2 1.2827 0.2579413
## <none> 14367 1750.8
## - rs744373 1 58.1 14425 1750.9 2.0273 0.1551222
## - rs3764650 1 107.2 14474 1752.7 3.7398 0.0536944 .
## - AV45 1 321.7 14688 1760.3 11.2180 0.0008711 ***
## - FDG 1 963.5 15330 1782.4 33.6006 1.198e-08 ***
## - HippoNV 1 1255.4 15622 1792.2 43.7795 9.461e-11 ***
## - MMSCORE 1 4175.1 18542 1880.7 145.5957 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1749.05
## TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + e4_1 +
## rs3818361 + rs744373 + rs11136000 + rs610932 + rs3851179 +
## rs3764650 + rs3865444 + MMSCORE
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - rs3851179 1 6.1 14379 1747.3 0.2131 0.6445342
## - e4_1 1 7.8 14380 1747.3 0.2732 0.6014218
## - rs3818361 1 9.4 14382 1747.4 0.3284 0.5668623
## - rs3865444 1 11.0 14384 1747.4 0.3847 0.5353627
## - rs11136000 1 16.4 14389 1747.6 0.5716 0.4499918
## - PTEDUCAT 1 18.6 14391 1747.7 0.6514 0.4200115
## - PTGENDER 1 31.6 14404 1748.2 1.1037 0.2939585
## - rs610932 1 37.4 14410 1748.4 1.3071 0.2534718
## <none> 14373 1749.0
## - rs744373 1 57.1 14430 1749.1 1.9959 0.1583492
## - rs3764650 1 107.9 14481 1750.9 3.7701 0.0527365 .
## - AV45 1 326.3 14699 1758.7 11.3975 0.0007925 ***
## - FDG 1 960.3 15333 1780.5 33.5415 1.232e-08 ***
## - HippoNV 1 1266.6 15639 1790.7 44.2374 7.617e-11 ***
## - MMSCORE 1 4173.4 18546 1878.8 145.7671 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1747.27
## TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + e4_1 +
## rs3818361 + rs744373 + rs11136000 + rs610932 + rs3764650 +
## rs3865444 + MMSCORE
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - e4_1 1 7.8 14387 1745.5 0.2730 0.6015868
## - rs3818361 1 8.4 14387 1745.6 0.2926 0.5888193
## - rs3865444 1 10.7 14390 1745.7 0.3748 0.5406827
## - rs11136000 1 15.4 14394 1745.8 0.5370 0.4640194
## - PTEDUCAT 1 18.1 14397 1745.9 0.6324 0.4268697
## - PTGENDER 1 31.6 14410 1746.4 1.1062 0.2934183
## - rs610932 1 38.8 14418 1746.7 1.3572 0.2445722
## - rs744373 1 55.2 14434 1747.2 1.9325 0.1650984
## <none> 14379 1747.3
## - rs3764650 1 106.6 14485 1749.1 3.7276 0.0540834 .
## - AV45 1 331.3 14710 1757.0 11.5910 0.0007157 ***
## - FDG 1 955.0 15334 1778.5 33.4071 1.313e-08 ***
## - HippoNV 1 1273.7 15652 1789.2 44.5556 6.548e-11 ***
## - MMSCORE 1 4180.1 18559 1877.2 146.2279 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1745.55
## TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + rs3818361 +
## rs744373 + rs11136000 + rs610932 + rs3764650 + rs3865444 +
## MMSCORE
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - rs3818361 1 9.7 14396 1743.9 0.3410 0.5595089
## - rs3865444 1 10.3 14397 1743.9 0.3605 0.5484712
## - rs11136000 1 15.8 14402 1744.1 0.5547 0.4567614
## - PTEDUCAT 1 19.4 14406 1744.2 0.6783 0.4105621
## - PTGENDER 1 32.8 14419 1744.7 1.1499 0.2840808
## - rs610932 1 38.6 14425 1744.9 1.3509 0.2456688
## <none> 14387 1745.5
## - rs744373 1 56.7 14443 1745.6 1.9877 0.1591935
## - rs3764650 1 105.7 14492 1747.3 3.7034 0.0548639 .
## - AV45 1 329.9 14716 1755.3 11.5565 0.0007287 ***
## - FDG 1 948.2 15335 1776.5 33.2180 1.436e-08 ***
## - HippoNV 1 1280.7 15667 1787.6 44.8679 5.646e-11 ***
## - MMSCORE 1 4173.2 18560 1875.2 146.1992 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1743.9
## TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + rs744373 +
## rs11136000 + rs610932 + rs3764650 + rs3865444 + MMSCORE
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - rs3865444 1 11.0 14407 1742.3 0.3866 0.5343679
## - rs11136000 1 13.3 14410 1742.4 0.4680 0.4942429
## - PTEDUCAT 1 17.7 14414 1742.5 0.6200 0.4314228
## - PTGENDER 1 32.2 14428 1743.0 1.1310 0.2880657
## - rs610932 1 37.1 14433 1743.2 1.3006 0.2546522
## <none> 14396 1743.9
## - rs744373 1 57.7 14454 1744.0 2.0244 0.1554028
## - rs3764650 1 108.2 14504 1745.8 3.7959 0.0519318 .
## - AV45 1 327.2 14724 1753.5 11.4776 0.0007594 ***
## - FDG 1 952.0 15348 1775.0 33.3941 1.318e-08 ***
## - HippoNV 1 1273.1 15669 1785.7 44.6574 6.220e-11 ***
## - MMSCORE 1 4199.5 18596 1874.2 147.3113 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1742.29
## TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + rs744373 +
## rs11136000 + rs610932 + rs3764650 + MMSCORE
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - rs11136000 1 12.6 14420 1740.7 0.4426 0.5061842
## - PTEDUCAT 1 17.9 14425 1740.9 0.6303 0.4276172
## - PTGENDER 1 35.1 14442 1741.5 1.2314 0.2676680
## - rs610932 1 35.9 14443 1741.6 1.2619 0.2618317
## <none> 14407 1742.3
## - rs744373 1 60.5 14468 1742.5 2.1237 0.1456571
## - rs3764650 1 110.2 14518 1744.2 3.8694 0.0497191 *
## - AV45 1 320.3 14728 1751.7 11.2507 0.0008556 ***
## - FDG 1 947.6 15355 1773.2 33.2813 1.390e-08 ***
## - HippoNV 1 1264.0 15671 1783.8 44.3947 7.024e-11 ***
## - MMSCORE 1 4304.8 18712 1875.5 151.1895 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1740.74
## TOTAL13 ~ PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV + rs744373 +
## rs610932 + rs3764650 + MMSCORE
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - PTEDUCAT 1 17.5 14437 1739.4 0.6153 0.433183
## - PTGENDER 1 34.5 14454 1740.0 1.2129 0.271274
## - rs610932 1 36.7 14457 1740.1 1.2902 0.256557
## <none> 14420 1740.7
## - rs744373 1 60.7 14481 1740.9 2.1338 0.144705
## - rs3764650 1 110.3 14530 1742.7 3.8795 0.049422 *
## - AV45 1 340.6 14760 1750.8 11.9738 0.000585 ***
## - FDG 1 943.0 15363 1771.5 33.1547 1.476e-08 ***
## - HippoNV 1 1267.4 15687 1782.3 44.5607 6.485e-11 ***
## - MMSCORE 1 4298.0 18718 1873.6 151.1148 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1739.37
## TOTAL13 ~ PTGENDER + FDG + AV45 + HippoNV + rs744373 + rs610932 +
## rs3764650 + MMSCORE
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - PTGENDER 1 26.5 14464 1738.3 0.9321 0.334783
## - rs610932 1 33.4 14471 1738.6 1.1741 0.279080
## <none> 14437 1739.4
## - rs744373 1 56.8 14494 1739.4 1.9984 0.158081
## - rs3764650 1 107.5 14545 1741.2 3.7811 0.052385 .
## - AV45 1 336.6 14774 1749.3 11.8445 0.000626 ***
## - FDG 1 945.5 15383 1770.2 33.2700 1.395e-08 ***
## - HippoNV 1 1250.1 15688 1780.3 43.9875 8.474e-11 ***
## - MMSCORE 1 4671.6 19109 1882.3 164.3755 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1738.32
## TOTAL13 ~ FDG + AV45 + HippoNV + rs744373 + rs610932 + rs3764650 +
## MMSCORE
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## - rs610932 1 30.5 14494 1737.4 1.0749 0.3003223
## <none> 14464 1738.3
## - rs744373 1 59.5 14523 1738.4 2.0937 0.1485233
## - rs3764650 1 110.2 14574 1740.2 3.8768 0.0494984 *
## - AV45 1 317.7 14782 1747.5 11.1806 0.0008875 ***
## - FDG 1 979.6 15444 1770.2 34.4741 7.800e-09 ***
## - HippoNV 1 1383.4 15847 1783.5 48.6817 9.413e-12 ***
## - MMSCORE 1 4666.8 19131 1880.9 164.2303 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=1737.41
## TOTAL13 ~ FDG + AV45 + HippoNV + rs744373 + rs3764650 + MMSCORE
##
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 14494 1737.4
## - rs744373 1 58.4 14553 1737.5 2.0545 0.152372
## - rs3764650 1 119.4 14614 1739.7 4.2007 0.040918 *
## - AV45 1 310.4 14805 1746.4 10.9203 0.001018 **
## - FDG 1 989.7 15484 1769.6 34.8228 6.589e-09 ***
## - HippoNV 1 1369.3 15864 1782.1 48.1792 1.187e-11 ***
## - MMSCORE 1 4758.8 19253 1882.2 167.4415 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.AD.F)
##
## Call:
## lm(formula = TOTAL13 ~ FDG + AV45 + HippoNV + rs744373 + rs3764650 +
## MMSCORE, data = AD_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.8420 -3.5542 -0.3011 3.1265 17.1034
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.6925 4.3758 19.126 < 2e-16 ***
## FDG -2.4749 0.4194 -5.901 6.59e-09 ***
## AV45 4.0970 1.2398 3.305 0.00102 **
## HippoNV -23.9726 3.4537 -6.941 1.19e-11 ***
## rs744373 -0.6805 0.4748 -1.433 0.15237
## rs3764650 1.2307 0.6005 2.050 0.04092 *
## MMSCORE -1.6961 0.1311 -12.940 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.331 on 510 degrees of freedom
## Multiple R-squared: 0.5633, Adjusted R-squared: 0.5582
## F-statistic: 109.7 on 6 and 510 DF, p-value: < 2.2e-16
anova(lm.AD.F ,lm.AD)
## Analysis of Variance Table
##
## Model 1: TOTAL13 ~ FDG + AV45 + HippoNV + rs744373 + rs3764650 + MMSCORE
## Model 2: TOTAL13 ~ AGE + PTGENDER + PTEDUCAT + FDG + AV45 + HippoNV +
## e2_1 + e4_1 + rs3818361 + rs744373 + rs11136000 + rs610932 +
## rs3851179 + rs3764650 + rs3865444 + MMSCORE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 510 14494
## 2 500 14363 10 131.89 0.4591 0.9158
By using final 6 predictors, R-squared value is 0.5633, which is not too far off from the R-squared value of 0.5673 that was resulted by the model using all 16 predictors. By applying F-test using anova() function, two models are statistically indistingushiable with p-value as 0.9158.
# Conduct diagnostics of the model
# install.packages("ggfortify")
library("ggfortify")
autoplot(lm.AD.F, which = 1:6, ncol = 3, label.size = 3)
Residuals vs fitted graph and scale location graph still do not show significant patterns in the graphs, indicating the assumption of non-linearity was not violated significantly.Normal Q-Q plot for the new model improved a lot from the previous model with only demographic predictors.
Therefore, we decided to select the model with the 6 predictors (FDG,AV45,HippoNV,rs744373,rs3764650,MMSCORE) as the final model.
# Evaluate the variable importance by all subsets regression
# install.packages("leaps")
library(leaps)
leaps<-regsubsets(TOTAL13 ~ ., data = AD_full,nbest=4)
# view results
summary(leaps)
## Subset selection object
## Call: regsubsets.formula(TOTAL13 ~ ., data = AD_full, nbest = 4)
## 16 Variables (and intercept)
## Forced in Forced out
## AGE FALSE FALSE
## PTGENDER FALSE FALSE
## PTEDUCAT FALSE FALSE
## FDG FALSE FALSE
## AV45 FALSE FALSE
## HippoNV FALSE FALSE
## e2_1 FALSE FALSE
## e4_1 FALSE FALSE
## rs3818361 FALSE FALSE
## rs744373 FALSE FALSE
## rs11136000 FALSE FALSE
## rs610932 FALSE FALSE
## rs3851179 FALSE FALSE
## rs3764650 FALSE FALSE
## rs3865444 FALSE FALSE
## MMSCORE FALSE FALSE
## 4 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AGE PTGENDER PTEDUCAT FDG AV45 HippoNV e2_1 e4_1 rs3818361
## 1 ( 1 ) " " " " " " " " " " " " " " " " " "
## 1 ( 2 ) " " " " " " "*" " " " " " " " " " "
## 1 ( 3 ) " " " " " " " " " " "*" " " " " " "
## 1 ( 4 ) " " " " " " " " "*" " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " "*" " " " " " "
## 2 ( 2 ) " " " " " " "*" " " " " " " " " " "
## 2 ( 3 ) " " " " " " " " "*" " " " " " " " "
## 2 ( 4 ) "*" " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " "*" " " " " " "
## 3 ( 2 ) " " " " " " " " "*" "*" " " " " " "
## 3 ( 3 ) " " " " " " "*" "*" " " " " " " " "
## 3 ( 4 ) " " " " " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " " " "*" "*" "*" " " " " " "
## 4 ( 2 ) " " " " " " "*" " " "*" " " " " " "
## 4 ( 3 ) " " " " " " "*" " " "*" " " " " " "
## 4 ( 4 ) " " " " " " "*" " " "*" " " " " " "
## 5 ( 1 ) " " " " " " "*" "*" "*" " " " " " "
## 5 ( 2 ) " " " " " " "*" "*" "*" " " " " " "
## 5 ( 3 ) " " " " " " "*" "*" "*" " " " " " "
## 5 ( 4 ) " " "*" " " "*" "*" "*" " " " " " "
## 6 ( 1 ) " " " " " " "*" "*" "*" " " " " " "
## 6 ( 2 ) " " " " " " "*" "*" "*" " " " " " "
## 6 ( 3 ) " " "*" " " "*" "*" "*" " " " " " "
## 6 ( 4 ) " " " " " " "*" "*" "*" " " " " " "
## 7 ( 1 ) " " " " " " "*" "*" "*" " " " " " "
## 7 ( 2 ) " " "*" " " "*" "*" "*" " " " " " "
## 7 ( 3 ) " " " " " " "*" "*" "*" " " " " " "
## 7 ( 4 ) " " " " " " "*" "*" "*" " " " " " "
## 8 ( 1 ) " " "*" " " "*" "*" "*" " " " " " "
## 8 ( 2 ) " " " " " " "*" "*" "*" " " " " " "
## 8 ( 3 ) " " " " " " "*" "*" "*" " " " " " "
## 8 ( 4 ) " " " " " " "*" "*" "*" " " "*" " "
## rs744373 rs11136000 rs610932 rs3851179 rs3764650 rs3865444
## 1 ( 1 ) " " " " " " " " " " " "
## 1 ( 2 ) " " " " " " " " " " " "
## 1 ( 3 ) " " " " " " " " " " " "
## 1 ( 4 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 2 ( 2 ) " " " " " " " " " " " "
## 2 ( 3 ) " " " " " " " " " " " "
## 2 ( 4 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 3 ( 2 ) " " " " " " " " " " " "
## 3 ( 3 ) " " " " " " " " " " " "
## 3 ( 4 ) " " " " " " " " "*" " "
## 4 ( 1 ) " " " " " " " " " " " "
## 4 ( 2 ) " " " " " " " " "*" " "
## 4 ( 3 ) "*" " " " " " " " " " "
## 4 ( 4 ) " " "*" " " " " " " " "
## 5 ( 1 ) " " " " " " " " "*" " "
## 5 ( 2 ) "*" " " " " " " " " " "
## 5 ( 3 ) " " " " "*" " " " " " "
## 5 ( 4 ) " " " " " " " " " " " "
## 6 ( 1 ) "*" " " " " " " "*" " "
## 6 ( 2 ) " " " " "*" " " "*" " "
## 6 ( 3 ) " " " " " " " " "*" " "
## 6 ( 4 ) " " " " " " " " "*" "*"
## 7 ( 1 ) "*" " " "*" " " "*" " "
## 7 ( 2 ) "*" " " " " " " "*" " "
## 7 ( 3 ) "*" "*" " " " " "*" " "
## 7 ( 4 ) "*" " " " " " " "*" "*"
## 8 ( 1 ) "*" " " "*" " " "*" " "
## 8 ( 2 ) "*" " " "*" " " "*" "*"
## 8 ( 3 ) "*" "*" "*" " " "*" " "
## 8 ( 4 ) "*" " " "*" " " "*" " "
## MMSCORE
## 1 ( 1 ) "*"
## 1 ( 2 ) " "
## 1 ( 3 ) " "
## 1 ( 4 ) " "
## 2 ( 1 ) "*"
## 2 ( 2 ) "*"
## 2 ( 3 ) "*"
## 2 ( 4 ) "*"
## 3 ( 1 ) "*"
## 3 ( 2 ) "*"
## 3 ( 3 ) "*"
## 3 ( 4 ) "*"
## 4 ( 1 ) "*"
## 4 ( 2 ) "*"
## 4 ( 3 ) "*"
## 4 ( 4 ) "*"
## 5 ( 1 ) "*"
## 5 ( 2 ) "*"
## 5 ( 3 ) "*"
## 5 ( 4 ) "*"
## 6 ( 1 ) "*"
## 6 ( 2 ) "*"
## 6 ( 3 ) "*"
## 6 ( 4 ) "*"
## 7 ( 1 ) "*"
## 7 ( 2 ) "*"
## 7 ( 3 ) "*"
## 7 ( 4 ) "*"
## 8 ( 1 ) "*"
## 8 ( 2 ) "*"
## 8 ( 3 ) "*"
## 8 ( 4 ) "*"
# plot a table of models showing variables in each model.
# models are ordered by the selection statistic.
plot(leaps,scale="r2")
leaps() function shows which model acheive highest R-squared value and which variables frequently appear on these models. By observing the graph, the highest R-squared value is resulted in the model that uses 8 predictors that are PTGENDER, FDG, AV45, HippoNV, rs744373, rs610932, rs3764650, MMSCORE. 6 predictors out of these 8 predictors are found in the model that we chose as the final model.
Passing significance test and fitting the model only mean that there is nothing significant against the model, meaning this is not the causal model therefore, other models can also fit the data possibliy.
assumptions of the estimations 1. the estimations of the regression parameters are independent ( correlations are zero) 2. the variances of the regression parameters are the same
Interactions of the predictors could contribute extra prediction power. Both contextual knowledge and meticulous craftwork of anlytics are needed to recognize important interation terms to be included in the model.
Scatter plots can be used to visualize the relationship between any variable with the outcome. They provides insights on how the relationship changes according to anotherariable
For continuous predictors:
# Supplement the model with some visualization of the statistical patterns
# Scatterplot matrix to visualize the relationship between outcome variable with continuous predictors
library(ggplot2)
# install.packages("GGally")
library(GGally)
# draw the scatterplots and also empirical shapes of the distributions of the variables
p <- ggpairs(AD[,c(17,1,3,4,5,6)], upper = list(continuous = "points")
, lower = list(continuous = "cor")
)
print(p)
For categorical predictors:
# Boxplot to visualize the relationship between outcome variable with categorical predictors
library(ggplot2)
qplot(factor(PTGENDER), TOTAL13, data = AD,
geom=c("boxplot"), fill = factor(PTGENDER))
qplot(factor(rs3818361), TOTAL13, data = AD,
geom=c("boxplot"), fill = factor(rs3818361))
qplot(factor(rs11136000), TOTAL13, data = AD,
geom=c("boxplot"), fill = factor(rs11136000))
qplot(factor(rs744373), TOTAL13, data = AD,
geom=c("boxplot"), fill = factor(rs744373))
qplot(factor(rs610932), TOTAL13, data = AD,
geom=c("boxplot"), fill = factor(rs610932))
qplot(factor(rs3865444), TOTAL13, data = AD,
geom=c("boxplot"), fill = factor(rs3865444))
# Histogram to visualize the relationship between outcome variable with categorical predictors
library(ggplot2)
qplot(TOTAL13, data = AD, geom = "histogram",
fill = factor(PTGENDER))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(TOTAL13, data = AD, geom = "histogram",
fill = factor(rs3818361))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(TOTAL13, data = AD, geom = "histogram",
fill = factor(rs11136000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(TOTAL13, data = AD, geom = "histogram",
fill = factor(rs744373))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(TOTAL13, data = AD[,c(10,12,15,17)], geom = "histogram",
fill = factor(rs610932))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(TOTAL13, data = AD[,c(10,12,15,17)], geom = "histogram",
fill = factor(rs3865444))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Using all variables:
library(rpart)
library(rpart.plot)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following object is masked from 'package:car':
##
## recode
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
##
## complete
library(ggplot2)
library(partykit)
## Loading required package: grid
library(RCurl)
#### Example: Alzheimer's Disease
# filename
AD <- read.csv(text=getURL("https://raw.githubusercontent.com/shuailab/ind_498/master/resource/data/AD2.csv"))
AD$ID = c(1:dim(AD)[1])
str(AD)
## 'data.frame': 517 obs. of 18 variables:
## $ AGE : num 71.7 77.7 72.8 69.6 70.9 65.1 79.6 73.6 60.7 70.6 ...
## $ PTGENDER : int 2 1 2 1 1 2 2 2 1 2 ...
## $ PTEDUCAT : int 14 18 18 13 13 20 20 18 19 18 ...
## $ FDG : num 6.82 6.37 6.37 6.37 6.37 ...
## $ AV45 : num 1.11 1.11 1.11 1.11 1.11 ...
## $ HippoNV : num 0.529 0.538 0.269 0.576 0.601 ...
## $ e2_1 : int 1 0 0 0 1 0 0 0 0 1 ...
## $ e4_1 : int 0 0 1 0 0 1 1 1 1 1 ...
## $ rs3818361 : int 1 1 1 1 1 1 1 1 0 0 ...
## $ rs744373 : int 1 0 1 1 1 0 1 1 0 1 ...
## $ rs11136000: int 1 1 1 1 1 0 0 1 0 0 ...
## $ rs610932 : int 1 1 0 1 0 1 1 1 0 1 ...
## $ rs3851179 : int 1 0 1 0 0 1 0 0 1 0 ...
## $ rs3764650 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ rs3865444 : int 0 1 1 0 0 0 1 1 1 0 ...
## $ MMSCORE : int 26 30 30 28 29 30 30 27 28 30 ...
## $ TOTAL13 : num 8 1.67 12 3 10 3.67 4 11 3 9 ...
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
theme_set(theme_gray(base_size = 15))
data <- AD
tree <- rpart(TOTAL13 ~ ., data)
prp(tree, nn.cex = 1)
Because TOTAL13 is the continuous variables, the values of a variable are firstly ordered, and then the average of each pair of consecutive values is used as the splitting value.
print(tree$variable.importance)
## MMSCORE FDG HippoNV AV45 ID AGE
## 12993.9095 6406.7705 5735.2366 2473.9718 1317.8796 1205.4757
## e4_1 PTEDUCAT rs3865444 rs3818361 e2_1
## 1175.9144 830.9080 395.6805 171.4647 137.1717
MMSCORE has the largest importance scores amongall variables.
The tree can be further pruned with the prune function whereas the parameter cp controls the model complexity. cp is the minimum relative error improved by splitting the node. A larger cp leads to a less-complex tree. First, let we try cp = 0.05 and then cp = 0.1
tree_0.05 <- prune(tree, cp = 0.05)
prp(tree_0.05, nn.cex = 1)
tree_0.1 <- prune(tree, cp = 0.1)
prp(tree_0.1, nn.cex = 1)
cp can be used to control the model complexity in pruning. cp can be decided by minimizing the cross-validation error. #For each tree, the number of decision nodes is recorded and used for measuring the complexity of the tree.
Decision tree selected the 4 variables : MMSCORE, HippoNV, FDG and e4_1 as nodes. #Three variables including FDB, HippoNV, MMSCORE are the common variables with the fianl model with 6 valueables (FDG,AV45,HippoNV,rs744373,rs3764650,MMSCORE).
when applying a decision tree to a dataset with linear relationship between predictors and outcome variables, it may not be an optimal choice. It can be seen the classification boundary from the logistics regression model is linear, while the one from the decision tree is parallel to the axis. This limitation makes a decision tree not be able to fully capture the linear relationship in the data.
textbook, PennState Eberly College of Science (https://onlinecourses.science.psu.edu/stat501), University of Virginia Library (http://data.library.virginia.edu/understanding-q-q-plots/)
Find two data sets from the UCI data repository or R datasets. Conduct a detailed regression analysis for both datasets using both regression model and the tree model (for regression), e.g., for regression model, you may want to conduct model selection, model comparison, testing of the significance of the regression parameters, evaluation of the R-squared and significance of the model. Also comment on the application of your model on the context of the dataset you have selected.
The first dataset we chose for modeling is the cu.summary dataset that contains automobile data taken from the April, 1990 issue of “Consumer Reports”. The dataset contains 117 observations and 5 features. We are trying to predict the price of the car. A table of feature descriptions is provided below.
| Feature Name | Description |
|---|---|
| Price | a numeric vector giving the list price in US dollars of a standard model |
| Country | Country of origin |
| Reliability | an ordered factor with levels ‘Much worse’ < ‘worse’ < ‘average’ < ‘better’ < ‘Much better’ |
| Mileage | fuel consumption miles per US gallon |
| Type | a factor with levels Compact Large Medium Small Sporty Van |
We fit a linear regression model with all of the features to predict the price of the car. Looking at the summary, we see that this linear regression model explains 83.7% of the variability in the data.
lm.car <- lm(Price~., data=df.car)
summary(lm.car)
##
## Call:
## lm(formula = Price ~ ., data = df.car)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3078.1 -1048.3 3.8 920.1 5341.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18992.2 3389.0 5.604 3.42e-06 ***
## CountryJapan 1213.8 2719.7 0.446 0.65838
## CountryJapan/USA -154.6 2756.3 -0.056 0.95562
## CountryKorea -2203.6 2962.6 -0.744 0.46243
## CountryMexico -3176.7 3298.4 -0.963 0.34272
## CountrySweden 5558.9 3164.0 1.757 0.08850 .
## CountryUSA -2625.2 2566.9 -1.023 0.31411
## Reliabilitybetter 1853.7 1460.1 1.270 0.21340
## ReliabilityMuch better -913.6 1452.3 -0.629 0.53377
## ReliabilityMuch worse -415.1 1267.3 -0.328 0.74538
## Reliabilityworse -822.7 1266.5 -0.650 0.52057
## Mileage -265.3 129.7 -2.044 0.04921 *
## TypeLarge 5140.8 1686.4 3.048 0.00459 **
## TypeMedium 5431.7 1070.7 5.073 1.61e-05 ***
## TypeSmall -2100.3 1346.9 -1.559 0.12875
## TypeSporty 894.6 1282.7 0.697 0.49057
## TypeVan 1723.1 1780.6 0.968 0.34045
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2091 on 32 degrees of freedom
## (68 observations deleted due to missingness)
## Multiple R-squared: 0.837, Adjusted R-squared: 0.7555
## F-statistic: 10.27 on 16 and 32 DF, p-value: 1.899e-08
We use the stepAIC function to identify and discard the insignificant features in our model. Looking at the summary, we see that none of the features have been discarded, showing that all of the features in our initial linear regression are significant. Also, let us note that our initial model is the optimal linear regression model given the data.
Looking at the summary, we see that the features which increase the price are: the size of the vehicle (medium or large), a better reliability, the country it was made in (Japan and Sweden), and the type of vehicle (sporty or van). Thus, three out of the four factors positively affect the price. The coefficients of the other features were all negative.
lm.car <- stepAIC(lm.car, direction="both")
## Start: AIC=762.39
## Price ~ Country + Reliability + Mileage + Type
##
## Df Sum of Sq RSS AIC
## <none> 139957482 762.39
## - Reliability 4 27132444 167089926 763.07
## - Mileage 1 18280336 158237818 766.40
## - Country 6 80914089 220871570 772.74
## - Type 5 168527460 308484942 791.11
summary(lm.car)
##
## Call:
## lm(formula = Price ~ Country + Reliability + Mileage + Type,
## data = df.car)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3078.1 -1048.3 3.8 920.1 5341.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18992.2 3389.0 5.604 3.42e-06 ***
## CountryJapan 1213.8 2719.7 0.446 0.65838
## CountryJapan/USA -154.6 2756.3 -0.056 0.95562
## CountryKorea -2203.6 2962.6 -0.744 0.46243
## CountryMexico -3176.7 3298.4 -0.963 0.34272
## CountrySweden 5558.9 3164.0 1.757 0.08850 .
## CountryUSA -2625.2 2566.9 -1.023 0.31411
## Reliabilitybetter 1853.7 1460.1 1.270 0.21340
## ReliabilityMuch better -913.6 1452.3 -0.629 0.53377
## ReliabilityMuch worse -415.1 1267.3 -0.328 0.74538
## Reliabilityworse -822.7 1266.5 -0.650 0.52057
## Mileage -265.3 129.7 -2.044 0.04921 *
## TypeLarge 5140.8 1686.4 3.048 0.00459 **
## TypeMedium 5431.7 1070.7 5.073 1.61e-05 ***
## TypeSmall -2100.3 1346.9 -1.559 0.12875
## TypeSporty 894.6 1282.7 0.697 0.49057
## TypeVan 1723.1 1780.6 0.968 0.34045
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2091 on 32 degrees of freedom
## (68 observations deleted due to missingness)
## Multiple R-squared: 0.837, Adjusted R-squared: 0.7555
## F-statistic: 10.27 on 16 and 32 DF, p-value: 1.899e-08
We now analyze our final linear regression model by use of the residual vs fitted and QQ-plots. The residual vs fitted plot shows randomly distributed points, which means that the regression assumptions have not been violated. The QQ-plot shows points clustered along the line indicating that the dependent variables can plausibly be normally distributed.
par(mar = rep(2, 4)) # Change the margins
plot(lm.car, which=c(1))
plot(lm.car, which=c(2))
## Warning: not plotting observations with leverage one:
## 6, 12, 32
This model can be used to study which factors have the biggest impact on increasing the price of a vehicle. Understanding the extent in which certain features inflate the price of a vehicle can help manufacturers decide which areas they should focus on when attempting to decrease their prices.
A model of the decision tree fitted to our dataset is produced below. The decision tree only incorporates two out of the four features; the tree incorporates the type and country of origin of each vehicle while it discards the information pertaining to the reliability and mileage of the vehicle. In contrast, the linear regression model incorporated all four of the given features and showed that including three of the features would always positively affect the price. This discrepancy between models suggests that each model may be able to better explain certain characteristics of the data.
library("rpart.plot")
tr.car <- rpart(Price~., data=df.car)
prp(tr.car, varlen=3)
The second dataset we chose for modeling is the Males dataset that recorded data about the wages and education of young males. The dataset contained 4360 observations and 12 features. We are trying to predict the wage of the person. A table of feature descriptions is provided below.
| Feature Name | Description |
|---|---|
| nr | unique identifier |
| year | year data was collected |
| school | years of schooling |
| exper | years of experience |
| union | wage set by collective bargaining? |
| ethn | ethnicity |
| married | are they married? |
| health | health problems? |
| wage | log of hourly wage |
| industry | work industry |
| occupation | job occupation |
| residence | area of residence |
We first fit an initial linear regression model with all of the features to predict the wage of the individual. From looking at the summary, the initial linear regression model is able to explain 28.54% of the variability of the wage.
lm.males <- lm(wage~., data=df.males)
summary(lm.males)
##
## Call:
## lm(formula = wage ~ ., data = df.males)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1929 -0.2273 0.0220 0.2561 2.2798
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -5.833e+01 1.487e+01
## nr -6.947e-06 2.662e-06
## year 2.956e-02 7.531e-03
## school 7.606e-02 6.392e-03
## exper 1.747e-02 6.667e-03
## unionyes 1.786e-01 2.030e-02
## ethnhisp 9.737e-02 3.459e-02
## ethnother 1.136e-01 2.621e-02
## marriedyes 8.674e-02 1.789e-02
## healthyes -5.440e-02 6.555e-02
## industryBusiness_and_Repair_Service 2.331e-01 6.763e-02
## industryConstruction 3.289e-01 6.845e-02
## industryEntertainment -1.460e-01 9.023e-02
## industryFinance 4.802e-01 7.562e-02
## industryManufacturing 3.794e-01 6.250e-02
## industryMining 4.424e-01 9.611e-02
## industryPersonal_Service 1.970e-01 8.613e-02
## industryProfessional_and_Related Service 6.675e-02 6.815e-02
## industryPublic_Administration 2.886e-01 7.313e-02
## industryTrade 1.460e-01 6.287e-02
## industryTransportation 4.441e-01 6.743e-02
## occupationCraftsmen, Foremen_and_kindred 5.664e-02 3.163e-02
## occupationFarm_Laborers_and_Foreman 8.437e-03 9.345e-02
## occupationLaborers_and_farmers -3.519e-02 3.710e-02
## occupationManagers, Officials_and_Proprietors 1.448e-01 3.613e-02
## occupationOperatives_and_kindred -3.784e-02 3.229e-02
## occupationProfessional, Technical_and_kindred 1.723e-01 3.583e-02
## occupationSales_Workers 6.680e-02 4.391e-02
## occupationService_Workers -5.467e-02 3.419e-02
## residencenothern_central -1.455e-01 2.286e-02
## residencerural_area -1.038e-01 5.429e-02
## residencesouth -1.243e-01 2.172e-02
## t value Pr(>|t|)
## (Intercept) -3.921 9.00e-05 ***
## nr -2.610 0.009098 **
## year 3.925 8.86e-05 ***
## school 11.899 < 2e-16 ***
## exper 2.621 0.008811 **
## unionyes 8.795 < 2e-16 ***
## ethnhisp 2.815 0.004911 **
## ethnother 4.335 1.50e-05 ***
## marriedyes 4.848 1.31e-06 ***
## healthyes -0.830 0.406670
## industryBusiness_and_Repair_Service 3.447 0.000575 ***
## industryConstruction 4.806 1.62e-06 ***
## industryEntertainment -1.618 0.105733
## industryFinance 6.350 2.46e-10 ***
## industryManufacturing 6.069 1.44e-09 ***
## industryMining 4.603 4.34e-06 ***
## industryPersonal_Service 2.287 0.022237 *
## industryProfessional_and_Related Service 0.979 0.327446
## industryPublic_Administration 3.946 8.11e-05 ***
## industryTrade 2.323 0.020244 *
## industryTransportation 6.587 5.26e-11 ***
## occupationCraftsmen, Foremen_and_kindred 1.790 0.073483 .
## occupationFarm_Laborers_and_Foreman 0.090 0.928061
## occupationLaborers_and_farmers -0.948 0.342974
## occupationManagers, Officials_and_Proprietors 4.008 6.27e-05 ***
## occupationOperatives_and_kindred -1.172 0.241450
## occupationProfessional, Technical_and_kindred 4.808 1.60e-06 ***
## occupationSales_Workers 1.521 0.128310
## occupationService_Workers -1.599 0.109933
## residencenothern_central -6.366 2.22e-10 ***
## residencerural_area -1.912 0.055923 .
## residencesouth -5.723 1.15e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4523 on 3083 degrees of freedom
## (1245 observations deleted due to missingness)
## Multiple R-squared: 0.2854, Adjusted R-squared: 0.2782
## F-statistic: 39.72 on 31 and 3083 DF, p-value: < 2.2e-16
Next we used some feature selection using stepwise regresssion to fit the best model. The final model produced from the stepwise regression found all but the health feature to be signficant. By analyzing the difference in the models, we can see that the R-squared was only reduced to 28.53%. The final found every feature and the intercept to be significant. When looking at the coeffecients, that years of schooling, year of experience, union, ethnicity, and being married all had a positive impact on wages per unit increase while keeping all others constant. Other coeffecients were a mix of positive and negative values.
lm.males <- stepAIC(lm.males, direction="both")
## Start: AIC=-4910.9
## wage ~ nr + year + school + exper + union + ethn + married +
## health + industry + occupation + residence
##
## Df Sum of Sq RSS AIC
## - health 1 0.141 630.89 -4912.2
## <none> 630.75 -4910.9
## - nr 1 1.394 632.14 -4906.0
## - exper 1 1.405 632.15 -4906.0
## - year 1 3.152 633.90 -4897.4
## - ethn 2 3.879 634.63 -4895.8
## - married 1 4.809 635.56 -4889.2
## - residence 3 9.557 640.30 -4870.1
## - occupation 8 15.155 645.90 -4852.9
## - union 1 15.827 646.57 -4835.7
## - school 1 28.968 659.71 -4773.0
## - industry 11 50.919 681.67 -4691.1
##
## Step: AIC=-4912.21
## wage ~ nr + year + school + exper + union + ethn + married +
## industry + occupation + residence
##
## Df Sum of Sq RSS AIC
## <none> 630.89 -4912.2
## + health 1 0.141 630.75 -4910.9
## - nr 1 1.389 632.28 -4907.4
## - exper 1 1.417 632.30 -4907.2
## - year 1 3.155 634.04 -4898.7
## - ethn 2 3.862 634.75 -4897.2
## - married 1 4.785 635.67 -4890.7
## - residence 3 9.567 640.45 -4871.3
## - occupation 8 15.338 646.23 -4853.4
## - union 1 16.019 646.91 -4836.1
## - school 1 29.104 659.99 -4773.7
## - industry 11 50.937 681.82 -4692.3
summary(lm.males)
##
## Call:
## lm(formula = wage ~ nr + year + school + exper + union + ethn +
## married + industry + occupation + residence, data = df.males)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1925 -0.2273 0.0224 0.2568 2.2815
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -5.836e+01 1.487e+01
## nr -6.937e-06 2.662e-06
## year 2.957e-02 7.531e-03
## school 7.621e-02 6.389e-03
## exper 1.754e-02 6.666e-03
## unionyes 1.794e-01 2.028e-02
## ethnhisp 9.749e-02 3.459e-02
## ethnother 1.133e-01 2.620e-02
## marriedyes 8.652e-02 1.789e-02
## industryBusiness_and_Repair_Service 2.349e-01 6.760e-02
## industryConstruction 3.316e-01 6.837e-02
## industryEntertainment -1.451e-01 9.022e-02
## industryFinance 4.822e-01 7.558e-02
## industryManufacturing 3.814e-01 6.245e-02
## industryMining 4.450e-01 9.605e-02
## industryPersonal_Service 1.984e-01 8.611e-02
## industryProfessional_and_Related Service 6.897e-02 6.810e-02
## industryPublic_Administration 2.905e-01 7.309e-02
## industryTrade 1.485e-01 6.279e-02
## industryTransportation 4.458e-01 6.739e-02
## occupationCraftsmen, Foremen_and_kindred 5.734e-02 3.162e-02
## occupationFarm_Laborers_and_Foreman 1.046e-02 9.341e-02
## occupationLaborers_and_farmers -3.476e-02 3.710e-02
## occupationManagers, Officials_and_Proprietors 1.458e-01 3.611e-02
## occupationOperatives_and_kindred -3.779e-02 3.229e-02
## occupationProfessional, Technical_and_kindred 1.732e-01 3.580e-02
## occupationSales_Workers 6.662e-02 4.391e-02
## occupationService_Workers -5.474e-02 3.419e-02
## residencenothern_central -1.457e-01 2.285e-02
## residencerural_area -1.034e-01 5.429e-02
## residencesouth -1.242e-01 2.172e-02
## t value Pr(>|t|)
## (Intercept) -3.923 8.92e-05 ***
## nr -2.606 0.009199 **
## year 3.927 8.79e-05 ***
## school 11.928 < 2e-16 ***
## exper 2.632 0.008537 **
## unionyes 8.849 < 2e-16 ***
## ethnhisp 2.819 0.004855 **
## ethnother 4.325 1.57e-05 ***
## marriedyes 4.837 1.39e-06 ***
## industryBusiness_and_Repair_Service 3.475 0.000518 ***
## industryConstruction 4.849 1.30e-06 ***
## industryEntertainment -1.609 0.107776
## industryFinance 6.381 2.03e-10 ***
## industryManufacturing 6.108 1.14e-09 ***
## industryMining 4.633 3.76e-06 ***
## industryPersonal_Service 2.304 0.021311 *
## industryProfessional_and_Related Service 1.013 0.311247
## industryPublic_Administration 3.975 7.20e-05 ***
## industryTrade 2.365 0.018088 *
## industryTransportation 6.615 4.37e-11 ***
## occupationCraftsmen, Foremen_and_kindred 1.813 0.069878 .
## occupationFarm_Laborers_and_Foreman 0.112 0.910864
## occupationLaborers_and_farmers -0.937 0.348912
## occupationManagers, Officials_and_Proprietors 4.038 5.53e-05 ***
## occupationOperatives_and_kindred -1.170 0.242006
## occupationProfessional, Technical_and_kindred 4.839 1.37e-06 ***
## occupationSales_Workers 1.517 0.129320
## occupationService_Workers -1.601 0.109440
## residencenothern_central -6.375 2.10e-10 ***
## residencerural_area -1.905 0.056834 .
## residencesouth -5.719 1.17e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4523 on 3084 degrees of freedom
## (1245 observations deleted due to missingness)
## Multiple R-squared: 0.2853, Adjusted R-squared: 0.2783
## F-statistic: 41.03 on 30 and 3084 DF, p-value: < 2.2e-16
After obtaining this final model, we need to analyze the residual vs fitted plot and the QQ-plot to see if any of the linear regression assumptions are violated. From looking at the residual vs fitted plot, the points seem to be randomly distrubted and there are no trends as fitted values increase, so there does not seem to be any violations. When looking at the QQ-plot, there seems to be deviation from normality assumption.
plot(lm.males, which=c(1))
plot(lm.males, which=c(2))
The applications of this model could be used to study what types of features are correlated most with a high wage. This could be important in helping to decide policy of whether more education helps to increase wage and to study if there are discrepencies among different socioeconomic factors.
A decision tree was fit to the data and the output of the model is seen below. Compared to the regression model, the decision tree only incorporated the use of 5 out of 11 features. These features are industry, years of schooling, year the data was collected, married, and years of work experience. From the analyzing the rules the decision tree came up with, the observations with the highest wage are those who have 12+ years of school. This type can be used to see if there are any underlying patterns in the data that a linear regression might not be able to ascertain.
tr.males <- rpart(wage~.,data=df.males)
prp(tr.males, nn.cex=1)
Build a decision tree model based on the following dataset. Don’t use R. Use your pen and paper, and show the process.
df.q5 <- data.frame(x1=c(0.22,0.58,0.57,0.41,0.60,0.12,0.25,0.32),
x2=c(0.38,0.32,0.28,0.43,0.29,0.32,0.32,0.38),
y=factor(c("No","Yes","Yes","Yes","No","Yes","Yes","No")))
We chose an arbitrary rule of splitting on the 4 quantiles. The first split is on x2 at the 3rd quantile, which is \(x2 \leq 0.38\). The left node will contain data points (1,2,3,5,6,7,8) and the right node will contain data point (4).
\[e_{root} = -\frac{5}{8} log_{2}\frac{5}{8} - \frac{3}{8} log_{2}\frac{3}{8} = 0.9544\] \[e_{x2 \leq 0.38} = -\frac{4}{7} log_{2}\frac{4}{7} - \frac{3}{7} log_{2}\frac{3}{7} = 0.9852\] \[e_{x2 \gt 0.38} -\frac{1}{1} log_{2}\frac{1}{1} - \frac{0}{1} log_{2}\frac{0}{1} = 0\] \[IG = e_{root} - \frac{7}{8}*e_{x2 \leq 0.38}-\frac{1}{8}*e_{x2 \gt 0.38} = 0.0924\]
The second split was done on left node containing data points (1,2,3,5,6,7,8) on x2 at the 3rd quantile, which is \(x2 \leq 0.35\). The left node will contain data points (2,3,5,6,7) and the right node will contain data points (1,8).
\[e_{root} = -\frac{4}{7} log_{2}\frac{4}{7} - \frac{3}{7} log_{2}\frac{3}{7} = 0.9852\] \[e_{x2 \leq 0.35} = -\frac{4}{5} log_{2}\frac{4}{5} - \frac{1}{5} log_{2}\frac{1}{5} = 0.7219\] \[e_{x2 \gt 0.35} -\frac{0}{2} log_{2}\frac{0}{2} - \frac{2}{2} log_{2}\frac{2}{2} = 0\] \[IG = e_{root} - \frac{5}{7}*e_{x2 \leq 0.35}-\frac{2}{7}*e_{x2 \gt 0.35} = 0.4696\]
The third split was on the left node containin data points (2,3,5,6,7) on x1 at the 3rd quantile, which is \(x1 \leq 0.58\). The left node will contain data points (2,3,6,7) and the right node will contain data point (5).
\[e_{root} = -\frac{4}{5} log_{2}\frac{4}{5} - \frac{1}{5} log_{2}\frac{1}{5} = 0.7219\] \[e_{x1 \leq 0.58} = -\frac{4}{4} log_{2}\frac{4}{4} - \frac{0}{4} log_{2}\frac{0}{4} = 0\] \[e_{x1 \gt 0.58} -\frac{0}{1} log_{2}\frac{0}{1} - \frac{1}{1} log_{2}\frac{1}{1} = 0\] \[IG = e_{root} - \frac{4}{5}*e_{x1 \leq 0.58}-\frac{1}{5}*e_{x1 \gt 0.58} = 0.7219\]
Write your own R script to implement the least squares estimation of a regression model. Compare the output from your script with the output from lm().
We arbitrarily choose the number of input variables as p = 3, and define the following function.
#least_squares: Returns the least squares estimation of a regression model
least_squares <- function(x1, x2, x3, y){
# Creating the X and Y matrices
x_matrix = as.matrix(cbind(rep(1, 20), x1, x2, x3))
y_matrix = as.matrix(y)
# Computing the beta estimator
beta_estimator = solve( t(x_matrix) %*% x_matrix ) %*% t(x_matrix) %*% y_matrix
return(beta_estimator)
}
We can now compare our function against the output from lm()
# Choose random x1, x2, x3 inputs
x1 = runif(1:20)
x2 = runif(1:20)
x3 = runif(1:20)
# Create an arbitrary y
y = 4 + 2*x1 + 1*x2 + 8*x3
# Call our function: least_squares
least_squares(x1, x2, x3, y)
## [,1]
## 4
## x1 2
## x2 1
## x3 8
# Calling the R function for linear regression: lm
lm(y~., data = data.frame(x1, x2, x3, y))
##
## Call:
## lm(formula = y ~ ., data = data.frame(x1, x2, x3, y))
##
## Coefficients:
## (Intercept) x1 x2 x3
## 4 2 1 8